Implement and use cuda graph plans #16548
Open
+315
−178
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR tries to implement graph plan APIs for the CUDA backend, as well as implement code in ggml-backend.cpp to actually use the graph plan APIs when a backend supports it.
The main functional improvement is to support cuda graphs when the graph is split (e.g. for hybrid inference). Currently the graph update and reuse logic (
ggml_backend_sched_update_plans
) is a simple heuristic: only try updating previous graphs when the number of splits and their corresponding backends are the same as the previous run. As the benchmark shown this universally accelerate hybrid inference tg performance by up to 30%.The CUDA graph execution code is refactored and cleaned up. Two out of three original graph plan fail path are removed:
disable_due_to_failed_graph_capture
anddisable_due_to_too_many_updates
. The former one is due to the fact I found no code setting it to true. The latter is because I have currently no idea about the semantics in a split graph scenario. But it seems to not degrade the performance at all. Interestingly, I found that on my rig, even repeatitively build a graph then execute it only once is always faster than calling the kernels individually. I suspect it is the reason that the performance increased in tests even for CUDA only workloads, given this PR's optimization not targeting them. This of course needs to be verified on more hardware configurations.Performance comparison:
RTX 5090 + 13700k, 128GB 6400 MT/s RAM